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Abstract 

Embedding words in a vector space has gained 
a lot of attention in recent years. While state- 
of-the-art methods provide efficient computa¬ 
tion of word similarities via a low-dimensional 
matrix embedding, their motivation is often 
left unclear. In this paper, we argue that word 
embedding can be naturally viewed as a rank¬ 
ing problem due to the ranking nature of the 
evaluation metrics. Then, based on this in¬ 
sight, we propose a novel framework Wor¬ 
dRank that efficiently estimates word repre¬ 
sentations via robust ranking, in which the 
attention mechanism and robustness to noise 
are readily achieved via the DCG-like rank¬ 
ing losses. The performance of WordRank is 
measured in word similarity and word anal¬ 
ogy benchmarks, and the results are com¬ 
pared to the state-of-the-art word embedding 
techniques. Our algorithm is very competi¬ 
tive to the state-of-the- arts on large corpora, 
while outperforms them by a significant mar¬ 
gin when the training set is limited (i.e., sparse 
and noisy). With 17 million tokens, WordRank 
performs almost as well as existing methods 
using 7.2 billion tokens on a popular word 
similarity benchmark. Our multi-node dis¬ 
tributed implementation of WordRank is pub¬ 
licly available for general usage. 


1 Introduction 


Embedding words into a vector space, such that se¬ 
mantic and syntactic regularities between words are 
preserved, is an important sub-task for many appli¬ 


cations of natural language processing. Mikolov et 


al. (2013a I generated considerable excitement in the 


machine learning and natural language processing 
communities by introducing a neural network based 
model, which they call wordlvec. It was shown that 
wordlvec produces state-of-the-art performance on 
both word similarity as well as word analogy tasks. 
The word similarity task is to retrieve words that 
are similar to a given word. On the other hand, 
word analogy requires answering queries of the form 
a:b;c: ?, where a, b, and c are words from the vocab¬ 
ulary, and the answer to the query must be semanti¬ 
cally related to c in the same way as b is related to 
a. This is best illustrated with a concrete example: 
Given the query king:queen;man: ? we expect the 
model to output woman. 


The impressive performance of wordlvec led to 
a flurry of papers, which tried to explain and im¬ 
prove the performance of wordlvec both theoreti¬ 
cally ( Arora et ah, 2015| l and empirically ( [Levy and 
Goldberg, 2014||. One interpretation of wordlvec 


is that it is approximately maximizing the positive 


point wise mutual information (PMI), and Levy and 


Goldberg (20141 showed that directly optimizing 


this gives good results. On the other hand, Penning- 
ton et al. (2014[ | showed performance comparable to 
wordlvec by using a modified matrix factorization 
model, which optimizes a log loss. 


Somewhat surprisingly. Levy et al. (20151 showed 


that much of the performance gains of these new 
word embedding methods are due to certain hyper¬ 
parameter optimizations and system-design choices. 
In other words, if one sets up careful experiments, 
then existing word embedding models more or less 
perform comparably to each other. We conjecture 
that this is because, at a high level, all these methods 


















are based on the following template: From a large 
text corpus eliminate infrequent words, and compute 
a IWI X \C\ word-context co-occurrence count ma¬ 
trix; a context is a word which appears less than d 
distance away from a given word in the text, where 
d is a tunable parameter. Let w G W be a word 
and c G C be a context, and let be the (poten¬ 
tially normalized) co-occurrence count. One learns a 
function f{w,c) which approximates a transformed 
version of c- Different methods differ essentially 
in the transformation function they use and the para¬ 
metric form of / (Levy et ah, 20151. For exam¬ 
ple, GloVe (Pennington et ah, 2014l uses / {w, c) = 
(u^,Vc) where and Vc are k dimensional vec¬ 
tors, (•,•) denotes the Euclidean dot product, and 
one approximates / {w, c) Ri logX^ c- On the other 
hand, as Levy and Goldberg (20141 show, wordlvec 
can be seen as using the same f{w,c) as GloVe 
but trying to approximate / [w, c) ss PMI(X^ c) — 
log n, where PMI(-) is the pairwise mutual informa¬ 
tion ( [Cover and Thomas, 1991| l and n is the number 
of negative samples. 

In this paper, we approach the word embedding 
task from a different perspective by formulating 
it as a ranking problem. That is, given a word 
w, we aim to output an ordered list (ci,C 2 ,---) 
of context words from C such that words that co¬ 
occur with w appear at the top of the list. If 
rank(r(;, c) denotes the rank of c in the list, then typ¬ 
ical ranking losses optimize the following objective: 
Z^{iu,c)en P (rank(r(;, c)), where D C W x C is the 
set of word-context pairs that co-occur in the corpus, 
and p(-) is a ranking loss function that is monotoni- 
cally increasing and concave (see Sec. for a justi¬ 
fication). 


Casting word embedding as ranking has two dis¬ 
tinctive advantages. First, our method is discrimina¬ 
tive rather than generative; in other words, instead 
of modeling (a transformation of) directly, we 
only aim to model the relative order of val¬ 
ues in each row. This formulation fits naturally to 
popular word embedding tasks such as word simi¬ 
larity/analogy since instead of the likelihood of each 
word, we are interested in finding fhe mosf relevanf 
words in a given confexj^ Second, casfing word 


’Roughly speaking, this difference in viewpoint is analo¬ 
gous to the difference between pointwise loss function vs list¬ 


embedding as a ranking problem enables us fo de¬ 


sign models robusf fo noise (Yun el ah, 20141 and 
focusing more on differenliafing fop relevanf words, 
a kind of allenfion mechanism lhal has been proved 


very useful in deep learning (Larochelle and Hin 


ton, 2010t|Mnih ef ah, 2014]|Bahdanau ef ah, 2015| ). 
Bofh issues are very crifical in fhe domain of word 
embedding since (1) fhe co-occurrence mafrix mighf 
be noisy due fo grammafical errors or unconven¬ 
tional use of language, i.e., cerfain words mighf co¬ 
occur purely by chance, a phenomenon more acufe 
in smaller documenf corpora collecfed from diverse 
sources; and (2) if’s very challenging fo sorl oul a 
few mosf relevanf words from a very large vocabu¬ 
lary, fhus some kind of affenfion mechanism fhal can 
Irade off fhe resolution on mosf relevanf words wilh 
fhe resolulion on less relevanf words is needed. We 
will show in fhe experimenfs lhal our melhod can 
mitigate some of Ihese issues; wilh 17 million to¬ 
kens our melhod performs almosl as well as existing 
melhods using 7.2 billion fokens on a popular word 
similarily benchmark. 


2 Word Embedding via Ranking 
2.1 Notation 

We use w to denote a word and c to denote a con¬ 
text. The set of all words, that is, the vocabulary 
is denoted as W and the set of all context words is 
denoted C. We will use D C W x C to denote the 
set of all word-context pairs that were observed in 
the data, to denote the set of contexts that co- 
occured with a given word w, and similarly to 
denote the words that co-occurred with a given con¬ 
text c. The size of a set is denoted as | • |. The inner 
product between vectors is denoted as (•,•). 


2.2 Ranking Model 


Let denote the fc-dimensional embedding of a 
word w, and Vc denote that of a context c. For 
convenience, we collect embedding parameters for 
words and contexts as U := {u^j^gvV’ ^ 
{wicec- 

We aim to capture the relevance of context c for 
word w by the inner product between their embed¬ 
ding vectors, (u^,, Vc); the more relevant a context 
is, the larger we want their inner product to be. 


wise loss function used in ranking i Lee and Lin, 2013]l. 























We achieve this by learning a ranking model that is 
parametrized by U and V. If we sort the set of con¬ 
texts C for a given word w in terms of each context’s 
inner product score with the word, the rank 
cific context c in this list can be written as 
et ah, 20091 ): 

rank {w, c) = ^ 1 Vc) - (u^„, Vc') < 0) 

c'eC\{c} 

= Vc - Vc/) < 0), (1) 

c'eC\{c} 


of a spe- 
dUsunier 


where /(x < 0) is a 0-1 loss function which is 1 if 
X < 0 and 0 otherwise. Since l{x < 0) is a dis¬ 
continuous function, we follow the popular strategy 
in machine learning which replaces the 0-1 loss by 
its convex upper bound £(•), where ^(•) can be any 
popular loss function for binary classification such 
as the hinge loss lix) = max (0,1 — x) or the lo¬ 
gistic loss f'(x) = log 2 (1 -h 2“^) (Bartlett et ah. 


20061. This enables us to construct the following 


convex upper bound on the rank: 

rank (ru, c) < rank (m, c)= ^£((U^, Vc-Vc/)) (2) 

c'eC\{c} 


It is certainly desirable that the ranking model po¬ 
sitions relevant contexts at the top of the list; this 
motivates us to write the objective function to mini¬ 
mize as: 


J(U,V):=^ 

losWcettu, 


rank {w, c)+f5 


a 


(3) 


where is the weight between word w and con¬ 
text c quantifying the association between them, p(-) 
is a monotonically increasing and concave ranking 
loss function that measures goodness of a rank, and 
a > 0, /? > 0 are the hyperparameters of the model 
whose role will be discussed later. Following |Pen¬ 


nington et al. (20141, we use 


loss function p{-), on the other hand, we consider 
the class of monotonically increasing and concave 
functions. While mono tonicity is a natural require¬ 
ment, we argue that concavity is also important so 
that the derivative of p is always non-increasing; this 
implies that the ranking loss to be the most sensitive 
at the top of the list (where the rank is small) and 
becomes less sensitive at the lower end of the list 
(where the rank is high). Intuitively this is desir¬ 
able, because we are interested in a small number of 
relevant contexts which frequently co-occur with a 
given word, and thus are willing to tolerate errors on 
infrequent context^ Meanwhile, this insensitivity 
at the bottom of the list makes the model robust to 
noise in the data either due to grammatical errors or 
unconventional use of language. Therefore, a sin¬ 
gle ranking loss function p{-) serves two different 
purposes at two ends of the curve (see the example 
plots of p in Figure [^; while the left hand side of 
the curve encourages “high resolution” on most rel¬ 
evant words, the right hand side becomes less sen¬ 
sitive (with “low resolution”) to infrequent and pos¬ 
sibly noisy word^ As we will demonstrate in our 
experiments, this is a fundamental attribute (in addi¬ 
tion to the ranking nature) of our method that con¬ 
tributes its superior performance as compared to the 
state-of-the-arts when the training set is limited (i.e., 
sparse and noisy). 

What are interesting loss functions that can be 
used for /?(•)? Here are four possible alternatives, 
all of which have a natural interpretation (see the 
plots of all four p functions in Figure [TJ^a) and the 
related work in Sec. for a discussion). 


PO 

(x) : 

= X 

(identity) 

(5) 

pi 

(x) : 

= log2 (1 -h X) 

(logarithm) 

(6) 

P2 

(x) : 

^ 1 

log 2(2 + x) 

(negative DCG) 

(V) 


(x) : 

x^“* — 1 

(logi with f / 1) 


P3 

1-t 

(8) 


T^TD.r — 



if < Xmax 

Otherwise, 


(4) 


where we set Xmax = 100 and e = 0.75 in our ex¬ 
periments. That is, we assign larger weights (with a 
saturation) to contexts that appear more often with 
the word of interest, and vice-versa. For the ranking 


^This is similar to the attention mechanism found in human 
visual system that is able to focus on a certain region of an im¬ 
age with “high resolution” while perceiving the surrounding im¬ 
age in “low resolution” (Larochelle and Hinton, 2010(|Mnih et| 
al., 2014f. 

^ Due to the linearity of po{x) = x, this ranking loss doesn’t 
have the benefit of attention mechanism and robustness to noise 
since it treats all ranking errors uniformly. 
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Figure 1: (a) Visualizing different ranking loss functions p{x) as defined in Eqs. the lower part of p 3 {x) is truncated in 

order to visualize the other functions better, (b) Visualizing pi{{x + P)/a) with different a and /3; po is included to illustrate the 
dramatic scale differences between po and Pi • 


We will explore the performance of each of these 
variants in our experiments. For now, we turn our 
attention to efficient stochastic optimization of the 
objective function Q. 

2.3 Stochastic Optimization 

Plugging © into and replacing 
by c)€Q’ objective function becomes: 


J(U,V) = 




^ Ec'gC\{c}^((^«^Wc-Vc'))+A 

V “ ) 


(9) 


define fhe following upper bound of J (U, V): 


J(U,V,H) := 


^11),c 


pi^wc) + p'i^wc) * 


a V+a ^^((u«;,Vc-Vc/))-^, 


'-■E' 

c'ec\{c} 



• ( 11 ) 


Note that J (U, V) < J (U, V, H) for any H, due 
to ([TOl^ Also, minimizing ( [TT] ) yields the same U 
and V as minimizing Q. To see this, suppose U := 
{ut„}^„gvv ^ •= {vcicec minimizes @. Then, 

where 

(w,c)&Q. 


by letting H := 

^W,C — 


a 


( 12 ) 


This function contains summations over Q. and C, 
both of which are expensive to compute for a large 
corpus. Although stochastic gradient descent (SGD) 
( [Bottou and Bousquet, 2011 ) can be used to re¬ 
place the summation over fl by random sampling, 
the summation over C cannot be avoided unless p(-) 
is a linear function. To work around this problem, 
we propose to optimize a linearized upper bound of 
the objective function obtained through a first-order 
Taylor approximation. Observe that due to the con¬ 
cavity of /?(•), we have 

/o(x) < p -h p'• (x - (10) 

for any x and ^ / 0. Moreover, the bound is tight 
when ^ = x~^. This motivates us to introduce a 
set of auxiliary parameters H := c)er2 


Ec'eC\{c} ^ ((u^„, Vc - W')) + /3 ’ 

we have = J^U,Vy Therefore, 

it suffices to optimize ( [TT| ). However, unlike 
([TT]) admits an efficient SGD algorithm. To see this, 
rewrite (dD as 


(w,CyC') 


p{Cw;c)+p'{^w;c)'io^ P-Cwlc) 
\C\-1 


+-pi^wlc) ■ , (13) 

a 


where {w, c, c') G O x (C \ {c}). Then, it can be 
seen that if we sample uniformly from {w,c) G D 


''when p = po, one can simply set the auxiliary variables 
= 1 because po is already a linear function. 

























and c' G C \ {c}, then j{w, c, c') := 


|0|-(|C|-1) -r^^c 


| C |-1 


+ -P (?«,,c)-^((u«;,Vc - Vc/)) ), (14) 

a 


which does not contain any expensive summa¬ 
tions and is an unbiased estimator of ([T3), i.e., 
E [j{w, c, c')] = J (U, V, H). On the other hand, 
one can optimize ^u,,c exactly by using ( fT^ . Putting 
everything together yields a stochastic optimiza¬ 
tion algorithm WordRank, which can be special¬ 
ized to a variety of ranking loss functions p{-) with 
weights c (e.g., DCG (Discounted Cumulative 
Gain) ( Manning et ah, 2008| l is one of many pos¬ 
sible instantiations). Algorithm [T] contains detailed 
pseudo-code. It can be seen that the algorithm is di¬ 
vided into two stages: a stage that updates (U, V) 
and another that updates H. Note that the time com¬ 
plexity of the first stage is 0(|D|) since the cost of 
each update in Lines 8-10 is independent of the size 
of the corpus. On the other hand, the time complex¬ 
ity of updating S in Line 15 is C)(|0| |C|), which 
can be expensive. To amortize this cost, we em¬ 
ploy two tricks: we only update E after a few it¬ 
erations of U and V update, and we exploit the fact 
that the most computationally expensive operation 
in ( [T^ involves a matrix and matrix multiplication 
which can be calculated efficiently via the SGEMM 
routine in BLAS ([Dongarra et ah, 19901. 


2.4 Parallelization 

The updates in Lines 8-10 have one remarkable 
property: To update u^, Vc and v^/, we only need 
to read the variables u^„, Vc, Vc' and c- What this 
means is that updates to another triplet of variables 
u^, Vc and v ~, can be performed independently. 
This observation is the key to developing a parallel 
optimization strategy, by distributing the computa¬ 
tion of the updates among multiple processors. Due 
to lack of space, details including pseudo-code are 
relegated to the supplementary material. 


2.5 Interpreting of a and (3 


The update (12i indicates that ^^c is proportional 


to rank {w,c). On the other hand, one can observe 


Algorithm 1 WordRank algorithm. 

1: T]: step size 

2 : repeat 

3: // Stage 1: Update U and V 

4: repeat 

5: Sample {w, c) uniformly from D 

6: Sample d uniformly from C \ {c} 

7: // following three updates 

are executed simultaneously 
8: ^ - r/ • r^,c • P'{iw]c) • 

((Uiu, Vc-Vc')) • (Vc-Vc/) 

9 : ^ - V ■ ■ p'{iw]c) ■ 

d ((Uu,, Vc-Vc')) • U^^, 

10: Vc' ^ Vc' -t r/ • • p'{iw]c) • 

f ((Uu;, Vc-Vc')) • U^„ 

11: until U and V are converged 

12: // Stage 2: Update H 

13: for tu G W do 

14: for c G C do 

15: ^^,c = a/ (Ec'sCMc) Vc-Vc')) -t/3) 

16: end for 

17: end for 

18: until U, V and H are converged 


that the loss function £ (•) in ( fid] ) is weighted by a 
p' (Cwiy term. Since p(-) is concave, its gradient 


p' (•) is monotonically non-increasing (Rockafellar, 


19701. Consequently, when rank(r/;,c) and hence 
c is large, p' c) is small. In other words, the 
loss function “gives up” on contexts with high ranks 
in order to focus its attention on top of the list. The 
rate at which the algorithm gives up is determined 
by the hyperparameters a and (3. For the illustration 
of this effect, see the example plots of pi with dif¬ 
ferent a and (3 in Figure [TJb). Intuitively, a can be 
viewed as a scale parameter while f3 can be viewed 
as an ojfset parameter. An equivalent interpretation 
is that by choosing different values of a and (3 one 
can modify the behavior of the ranking loss p{-) inn 
problem dependent fashion. In our experiments, we 
found that a common setting of a = 1 and /3 = 0 of¬ 
ten yields uncompetitive performance, while setting 
a = 100 and /3 = 99 generally gives good results. 




















3 Related Work 

Our work sits at the intersection of word embed¬ 
ding and ranking optimization. As we discussed in 
Sec. |2.2| and Sec. |2.5| it’s also related to the atten¬ 
tion mechanism widely used in deep learning. We 
therefore review the related work along these three 
axes. 

Word Embedding. We already discussed some 
related work {wordlvec and GloVe) on word em¬ 
bedding in the introduction. Essentially, wordlvec 
and GloVe derive word representations by modeling 
a transformation (PMI or log) of directly, while 
WordRank learns word representations via robust 
ranking. Besides these state-of-the-art techniques, a 
few ranking-based approaches have been proposed 


Q, then — J (U, V) corresponds to the DCG (Dis¬ 
counted Cumulative Gain), one of the most popular 


ranking metrics used in web search ranking (Man- 
|ning et ah, 2008 1. In their RobiRank algorithm they 
proposed the use of p = pi which they consid¬ 
ered to be a special function for which one can de¬ 
rive an efficient stochastic optimization procedure. 
However, as we showed in this paper, the general 
class of monotonically increasing concave functions 
can be handled efficiently. Another important differ¬ 
ence of our approach is the hyperparameters a and 
/3, which we use to modify the behavior of p, and 
which we find are critical fo achieve good empirical 


results. Ding and Vishwanathan (20101 proposed the 


for word embedding recently, e.g., (Collobert and 
Weston, 2008t |Vilnis and McCallum, 2015 Liu et 


ah, 20151. However, all of them adopt a pair-wise 
binary classification approach with a linear rank¬ 


ing loss pq. For example, (Collobert and Weston, 
2008 [ IVilnis and McCallum, 2015) employ a hinge 


use of p = log^ in the context of robust binary classi¬ 
fication, while here we are concerned with ranking, 
and our formulation is very general and applies to a 
variety of ranking losses p (•) with weights c- Op¬ 
timizing over U and V by distributing the computa¬ 
tion across processors is inspired by work on dis¬ 
tributed stochastic gradient for matrix factorization 
( Gemulla et ah, 20TT] ). 


loss on positive/negative word pairs to learn word 
representations and po is used implicitly to evaluate 
ranking losses. As we discussed in Sec. |2.2[ po has 
no benefit of the attention mechanism and robust¬ 
ness to noise since its linearity treats all the rank¬ 
ing errors uniformly; empirically, sub-optimal per¬ 
formances are often observed with po in our exper¬ 
iments. More recently, by extending the Skip-Gram 
model of wordlvec, |Liu et al. (2015 1 incorporates 
additional pair-wise constraints induced from 3rd- 
party knowledge bases, such as WordNet, and learns 
word representations jointly. In contrast, WordRank 
is a fully ranking-based approach without using any 
additional data source for training. 

Robust Ranking. The second line of work that is 
very relevant to WordRank is that of ranking objec¬ 
tive ([^. The use of score functions (u^„,Vc) for ^ Experiments 


Attention. Attention is one of the most impor¬ 
tant advancements in deep learning in recent years 


(Larochelle and Hinton, 20101, and is now widely 


used in state-of-the-art image recognition and ma¬ 
chine translation systems (Mnih et ah, 2014] Bah- 


danau et ah, 20151. Recently, attention has also been 


applied to the domain of word embedding. For ex¬ 
ample, under the intuition that not all contexts are 


created equal, Wang et al. (20151 assign an impor¬ 
tance weight to each word type at each context po¬ 
sition and learn an attention-based Continuous Bag- 
Of-Words (CBOW) model. Similarly, within a rank¬ 
ing framework, WordRank expresses the context im¬ 
portance by introducing the auxiliary variable c; 
which “gives up” on contexts with high ranks in or¬ 
der to focus its attention on top of the list. 


ranking is inspired by the latent collaborative re¬ 
trieval framework of j^^stoF^TalTTIOK I. Writing 
the rank as a sum of indicator functions ([T]l, and 
upper bounding it via a convex loss ([^ is due to 
Usunier et al. (20091. Using po (•) (0 corresponds 
to the well-known pairwise ranking loss (see e.g.. 


(Lee and Lin, 2013)). On the other hand, Yun et 


al. (20141 observed that if they set p = p 2 as in 


In our experiments, we first evaluate the impact of 
the weight c and the ranking loss function p(-) 
on the test performance using a small dataset. We 
then pick the best performing model and compare it 
against wordlvec ( [Mikolov et ah, 2013b I and GloVe 
(Pennington et ah, 20141. We closely follow the 
framework of Levy et al. (2015|) to set up a careful 






















































and fair comparison of the three methods. Our code 


is publicly available at https 

;://bitbucket. 

org/shihaoji/wordrank 



Training Corpus Models are trained on a com¬ 
bined corpus of 7.2 billion tokens, which consists 
of the 2015 Wikipedia dump with 1.6 billion tokens, 
the WMT14 News Crawj^with 1.7 billion tokens, 
the “One Billion Word Language Modeling Bench- 
mark’0 with almost 1 billion tokens, and UMBC 
webbase corpu^with around 3 billion tokens. The 
pre-processing pipeline breaks the paragraphs into 
sentences, tokenizes and lowercases each corpus 
with the Stanford tokenizer. We further clean up 
the dataset by removing non-ASCII characters and 
punctuation, and discard sentences that are shorter 
than 3 tokens or longer than 500 tokens. In the end, 
we obtain a dataset of 7.2 billion tokens, with the 
first 1.6 billion tokens from Wikipedia. When we 
want to experiment with a smaller corpus, we ex¬ 
tract a subset which contains the specified number 
of tokens. 


Co-occurrence matrix construction We use the 

GloVe code to construct the co-occurrence matrix 
X, and the same matrix is used to train GloVe and 
WordRank models. When constructing X, we must 
choose the size of the vocabulary, the context win¬ 
dow and whether to distinguish left context from 
right context. We follow the findings and design 
choices of GloVe and use a symmetric window of 
size win with a decreasing weighting function, so 
that word pairs that are d words apart contribute 1/d 
to the total count. Specifically, when the corpus is 
small (e.g., 17M, 32M, 64M) we let win = 15 and 
for larger corpora we let win = 10. The larger win¬ 
dow size alleviates the data sparsity issue for small 
corpus at the expense of adding more noise to X. 
The parameter settings used in our experiments are 
summarized in Table [T] 


Using the trained model It has been shown by 


Pennington et al. (20141 that combining the and 


Vc vectors with equal weights gives a small boost 


^http://www.statrat.org/wmt14/ 
translation-task.html 

'^http : / / WWW . statmt . org/lm-benchmark 
'http://ebiquity.umbc.edu/resource/html/ 
id/351 


in performance. This vector combination was origi¬ 


nally motivated as an ensemble method (Pennington 


et ah, 2014 1 , and later [Levy et al. (2015| ) provided 


a different interpretation of its effect on the cosine 
similarity function, and show that adding context 
vectors effectively adds first-order similarity terms 
to the second-order similarity function. In our ex¬ 
periments, we find that vector combination boosts 
the performance in word analogy task when training 
set is small, but when dataset is large enough (e.g., 
7.2 billion tokens), vector combination doesn’t help 
anymore. More interestingly, for the word similarity 
task, we find that vector combination is detrimen¬ 
tal in all the cases, sometimes even substantiall)|^ 
Therefore, we will always use on word similarity 
task, and use -|- Vc on word analogy task unless 
otherwise noted. 


4.1 Evaluation 


Word Similarity We use six datasets to evaluate 
word similarity: WS-353 ( Finkelstein et ah, 2002| ) 
partitioned into two subsets: WordSim Similarity 
and WordSim Relatedness ( [Agirre et ah, 2009 ); 
MEN (|Bruni et ah, 2012|); Mechanical Turk (Radin¬ 


sky et ah, 20111; Rare words ( |Luong et ah, 2013 ); 
and SimLex-999 ([Hill et ah, 20141. They contain 


word pairs together with human-assigned similarity 
judgments. The word representations are evaluated 
by ranking the pairs according to their cosine simi¬ 
larities, and measuring the Spearman’s rank correla¬ 
tion coefficient with the human judgments. 


Word Analogies For this task, we use the Google 
analogy dataset ( [Mikolov et ah, 2013a I. It contains 
19544 word analogy questions, partitioned into 8869 
semantic and 10675 syntactic questions. A question 
is correctly answered only if the algorithm selects 
the word that is exactly the same as the correct word 
in the question: synonyms are thus counted as mis¬ 
takes. There are two ways to answer these questions. 


namely, by using 3CosAdd or 3CosMul (see (Levy 


and Goldberg, 20141 for details). We will report 


scores by using 3CosAdd by default, and indicate 
when 3CosMul gives better performance. 


*This is possible since we optimize a ranking loss: the ab¬ 
solute scores don’t matter as long as they yield an ordered list 
correctly. Thus, WordRank'?, u™ and Vc are less comparable 
to each other than those generated by GloVe, which employs a 
point-wise L 2 loss. 






























Table 1: Parameter settings used in the experiments. 


Corpus Size 

17M* 

32M 

64M 

128M 

256M 

512M 

FOB 

1.6B 

7.2B 

Vocabulary Size W 

71K 

lOOK 

lOOK 

200K 

200K 

300K 

300K 

400K 

620K 

Window Size win 

15 

15 

15 

10 

10 

10 

10 

10 

10 

Dimension k 

100 

100 

100 

200 

200 

300 

300 

300 

300 


* This is the TextS dataset from http: / /mattmahoney. net/dc/text8 . zip which is widely used for word embedding demo. 
Table 2: Performance of different p functions on TextS dataset with 17M tokens. 


Task 

Robi 

Po 

Pi 

P2 

P3 

off 

on 

off 

on 

off 

on 

off 

on 

Similarity 

41.2 

69.0 

71.0 

66.7 

70.4 

66.8 

70.8 

68.1 

68.0 

Analogy 

22.7 

24.9 

31.9 

34.3 

44.5 

32.3 

40.4 

33.6 

42.9 


Handling questions with out-of-vocabulary 
words Some papers (eg., ( [Levy et al, 2015] )) 
filter out questions with out-of-vocabulary words 
when reporting performance. By contrast, in our 
experiments if any word of a question is out of 
vocabulary, the corresponding question will be 
marked as unanswerable and will get a score of 
zero. This decision is made so that when the size of 
vocabulary increases, the model performance is still 
comparable across different experiments. 


4.2 The impact of and p{-) 


In Sec. 2.2 we argued the need for adding weight 


to ranking objective Q, and we also presented 
our framework which can deal with a variety of 
ranking loss functions p. We now study the utility 
of these two ideas. We report results on the 17 mil¬ 
lion token dataset in Table For the similarity task, 
we use the WS-353 test set and for the analogy task 
we use the Google analogy test set. The best scores 
for each task are underlined. We set t = 1.5 for p 3 . 
“Off” means that we used uniform weight c = 1, 
and “on” means that c was set as in Q. For com¬ 
parison, we also include the results using RobiRank 
( Yun et ah, 2014| P 


It can be seen from Table |^that adding the weight 
improves performance in all the cases, espe¬ 
cially on the word analogy task. Among the four 
p functions, po performs the best on the word simi¬ 
larity task but suffers notably on the analogy task, 
while Pi = log performs the best overall. Given 


®We used the code provided by the authors at https : // 
bitbucket. org/d_i jk_stra/robirank Although 
related to RobiRank, we attribute the superior performance of 
WordRank to the use of weight (|^, introduction of hyper¬ 
parameters a and /3, and many implementation details. 


these observations, which are consistent with the re¬ 
sults on large scale datasets, in the experiments that 
follow we only report WordRank with the best con¬ 
figuration, i.e., using pi with the weight c as de¬ 
fined in Q. 

4.3 Comparison to state-of-the-arts 

In this section we compare the performance of Wor¬ 
dRank with wordlve^^ and by using the 

code provided by the respective authors. For a fair 
comparison, GloVe and WordRank are given as in¬ 
put the same co-occurrence matrix X; this elimi¬ 
nates differences in performance due to window size 
and other such artifacts, and the same parameters 
are used to wordlvec. Moreover, the embedding di¬ 
mensions used for each of the three methods is the 
same (see Table [T}. With wordlvec, we train the 
Skip-Gram with Negative Sampling (SGNS) model 
since it produces state-of-the-art performance, and 
is widely used in the NLP community ( |Mikolov~H 
ah, 2013bl l. For GloVe, we use the default parame¬ 
ters as suggested by ( [Pennington et ah, 2014| ). The 
results are provided in Figure (also see Table in 
the supplementary material for additional details). 

As can be seen, when the size of corpus increases, 
in general all three algorithms improve their predic¬ 
tion accuracy on both tasks. This is to be expected 
since a larger corpus typically produces better statis¬ 
tics and less noise in the co-occurrence matrix X. 
When the corpus size is small {e.g., 17M, 32M, 
64M, 128M), WordRank yields the best performance 
with significant margins among three, followed by 
wordlvec and GloVe', when the size of corpus in¬ 
creases further, on the word analogy task wordlvec 

**‘https : //code . google . cora/p/word2vec/ 

'http://nlp.stanford.edu/projects/glove 















































Figure 2: Performance evolution as a function of corpus size (a) on WS-353 word similarity benchmark; (b) on Google word 
analogy benchmark. 

Table 3: Performance of the best word2vec, GloVe and WordRank models, learned from 7.2 billion tokens, on six similarity tasks 
and Google semantic and syntactic subtasks. 



Word Similarity 

Word Analogy 

Model 

WordSim 

WordSim 

Bruni et 

Radinsky 

Luong et 

Hill et al. 

Goog 

Goog 


Similarity 

Relatedness 

al. MEN 

et al. MT 

al. RW 

SimLex 

Sem. 

Syn. 

wordlvec 

73.9 

60.9 

75.4 

66.4 

45.5 

36.6 

78.8 

72.0 

GloVe 

75.7 

67.5 

78.8 

69.7 

43.6 

41.6 

80.9 

71.1 

WordRank 

79.4 

70.5 

78.1 

73.5 

47.4 

43.5 

78.4 

74.7 


and GloVe become very competitive to WordRank, 
and eventually perform neck-to-neck to each other 
(Figure l^b)). This is consistent with the findings of 
( [Levy et aL, 2015 1 indicating that when the number 
of tokens is large even simple algorithms can per¬ 
form well. On the other hand, WordRank is dom¬ 
inant on the word similarity task for all the cases 
(Figure]^ a)) since it optimizes a ranking loss explic¬ 
itly, which aligns more naturally with the objective 
of word similarity than the other methods; with 17 
million tokens our method performs almost as well 
as existing methods using 7.2 billion tokens on the 
word similarity benchmark. 

As a side note, on a similar 1.6-billion-token 
Wikipedia corpus, our word2vec and GloVe perfor¬ 
mance scores are somewhat better than or close to 
the results reported by Pennington et al. (2014| ); and 
our wordlvec and GloVe scores on the 7.2-billion- 
token dataset are close to what they reported on a 
42-billion-token dataset. We believe this discrep¬ 
ancy is primary due to the extra attention we paid 
to pre-process the Wikipedia and other corpora. 


scribed in Sec. 4.1 Moreover, we breakdown the 


performance of the models on the Google word anal¬ 
ogy dataset into the semantic and syntactic subtasks. 
Results are listed in Table As can be seen, Wor¬ 
dRank outperforms wordlvec and GloVe on 5 of 6 
similarity tasks, and 1 of 2 Google analogy subtasks. 


5 Visualizing the results 

To understand whether WordRank produces syntat- 
ically and semantically meaningful vector space, 
we did the following experiment: we use the best 
performing model produced using 7.2 billion to¬ 
kens, and compute the nearest neighbors of the word 
“cat”. We then visualize the words in two dimen¬ 
sions by using t-SNE ( [Maaten and Hinton, 2008 1. 
As can be seen in Figure]^ our ranking-based model 
is indeed capable of capturing both semantic {e.g., 
cat, feline, kitten, tabby) and syntactic {e.g., leash, 
leashes, leashed) regularities of the English lan¬ 
guage. 

6 Conclusion 


To further evaluate the model performance on the 
word similarity/analogy tasks, we use the best per¬ 
forming models trained on the 7.2-bilhon-token cor¬ 
pus to predict on the six word similarity datasets de¬ 


We proposed WordRank, a ranking-based approach, 
to learn word representations from large scale tex¬ 
tual corpora. The most prominent difference be¬ 
tween our method and the state-of-the-art tech- 
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Figure 3: Nearest neighbors of “cat” found by projecting a 
300d word embedding learned from WordRank onto a 2d space. 


niques, such as wordlvec and GloVe, is that Wor¬ 
dRank leams word representations via a robust rank¬ 
ing model, while word2vec and GloVe typically 
model a transformation of co-occurrence count c 
directly. Moreover, by a ranking loss function p(-), 
WordRank achieves its attention mechanism and ro¬ 
bustness to noise naturally, which are usually lack¬ 
ing in other ranking-based approaches. These at¬ 
tributes significantly boost the performance of Wor¬ 
dRank in the cases where training data are sparse and 
noisy. Our multi-node distributed implementation of 
WordRank is publicly available for general usage. 
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WordRank: Learning Word Embeddings via Robust Ranking 

(Supplementary Material) 
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A Parallel WordRank 


Given number of p workers, we partition words 
W into p parts - } such that 

they are mutually exclusive, exhaustive and approx¬ 
imately equal-sized. This partition on W induces 
a partition on U, O and E as follows: := 

G and 

:= for 1 < g < p. When the 

algorithm starts, and are distributed 

to worker q. 

At the beginning of each outer iteration, 

an approximately equal-sized partition 
C(2),... ,C(P)} on the context set C is sampled; 
note that this is independent of the partition on 
words W. This induces a partition on context 
vectors defined as follows: 

:= {vc}^g^( 5 ) for each q. Then, each is 

distributed to each worker q. Now we define 

X] j{w,c,c'), (15) 

(io,c)eon(w(5)xC(9)) c'eC(9)\{c} 


where j{w,c,d) was defined in Note that 

j{w,c,c') in the above equation only accesses u^, 
Vc and Vf,/ which belong to no sets other than 
and therefore worker q can run stochastic 

gradient descent updates on ( [T5| ) for a predefined 
amount of time without having to communicate with 
other workers. The pseudo-code is illustrated in Al¬ 
gorithm]^ 

Considering that the scope of each worker is al¬ 
ways confined to a rather narrow set of observa¬ 
tions n n X it is somewhat surprising 

that Gemulla et al. (20TT] | proved that such an opti¬ 
mization scheme, which they call stratified stochas¬ 
tic gradient descent (SSGD), converges to the same 
local optimum a vanilla SGD would converge to. 
This is due to the fact that 

E “(1)) -h v(2),h(2))+ 

...^j(p)(u{p)^V(p),h(p)) 1 Ri J(U,V,“), (16) 


if the expectation is taken over the sampling of the 
partitions of C. This implies that the bias in each 
iteration due to narrowness of the scope will be 
washed out in a long run; this observation leads to 
the proof of convergence in|Gemulla et al. (201 1|) us¬ 


ing standard theoretical results from Yin and Kush- 
n er (2003) . 


Algorithm 2 Distributed WordRank algorithm. 
rj: step size 

repeat 

//Start outer iteration 

Sample a partition over contexts 

//Step l:Update U,V in parall. 

for all machine g G {1, • • • ,p} do in parallel 

Fetch all G 

repeat 

Sample {w, c) uniformly from n 

Sample c' uniformly from \ {c} 

//following three updates 
are done simultaneously 


Uw 

•(— 

- 

r] ■ 

• p'i^wlc) 

£'{{ 

u^, Vc-Vc 


■ (Vc-Vc') 


Vc 

^ Vc 

- 

d ■ ■ 


f({ 

U^, Vc-Vc 


■ Uu, 


Vc' 

^ Vc' 

-h 

f? • rw,c ■ 

■ P'(C,c) 

f (( 

U^, Vc-Vc 


■ 



until predefined time limit is exceeded 

end for 

//Step 2: Update E in parallel 

for all machine g G {1, • • • , p} do in parallel 
Fetch all Vc G V 
for ru G do 
for c G C do 

iw,c = Ot/ (Ec'sCUc} ^ Vc-Vc/))+/3) 

end for 
end for 

end for 

until U, V and E are converged 
















Table 4: Performance of wordlvec, GloVe and WordRank on datasets with increasing sizes; evaluated on WS-353 word similarity 
benchmark and Google word analogy benchmark. 


Corpus Size 

WS-353 (Word Similarity) 

Google (Word Analogy) 

word2vec 

GloVe 

WordRank 

wordlvec 

GloVe 

WordRank 

17M 

66.8 

47.8 

70.4 

39.2 

30.4 

44.5 

32M 

64.1 

47.8 

68.4 

42.3 

30.9 

52.1 

64M 

67.5 

55.0 

70.8 

53.5 

42.0 

59.9 

128M 

70.7 

54.5 

72.8 

59.8 

50.4 

65.1 

256M 

72.0 

59.5 

72.4 

67.6 

60.3 

68.6 

512M 

113 

64.5 

74.1 

70.6 

66.4 

70.6 

FOB 

73.3 

68.3 

74.0 

70.4 

68.7 

70.8 

1.6B 

71.8 

69.5 

74.1 

72.1 

70.4 

71.7 

7.2B 

73.4 

70.9 

75.2/77.4^ 

75.1^ 

75.6^ 

76.0^’^ 


^ When po is used, con'esponding to setting ^ = 1 in training and no ^ update 
^ Use 3CosMul instead of regular 3CosAdd for evaluation 
^ Use instead of default u^, + Vc as word representation for evaluation 


B Additional Experimental Details 

Table l^is the tabular view of the data plotted in Fig¬ 
ure]^ to provide additional experimental details. 





















